getwd()
## [1] "/home/user3/workshop"
Raw data from Pezzini
et al 2017 was subjected to differential gene expression analysis
with Degust and the results
file saved to Pezzini_DE.txt.
The input data file is within the current working directory so we do not need to specify its directory path.
data <- read_tsv("Pezzini_DE.txt", col_names = TRUE, show_col_types = FALSE)
head(data)
The dataframe shows genes with fold change and FDR values, along with some normalised counts values for the 6 samples (2 groups with 3 replicates each).
Look on the environment pane of RStudio, and you can see a description ‘14420 obs. of 10 variables’ - this shows your dataframe consists of 10 columns and 14,420 genes.
Now we need to filter for differentially expressed genes (DEGs), and we will apply the thresholds adjusted P values/FDR < 0.01, and log2fold change of 2.
degs <- data %>%
filter(FDR < 0.01 & abs(Log2FC) > 2) %>%
pull(Gene.ID)
cat("Number of genes passing FDR and fold change filter:", length(degs), "\n")
## Number of genes passing FDR and fold change filter: 792
# Save the DEG gene list to disk:
write.table(degs, "Pezzini_DEGs.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
We have 792 genes passing our filters.
Recall from the webinar and day 1 of the workshop that an experimental background gene list is crucial to avoiding false positives and minimising tissue bias with ORA.
The analysis in Degust has already removed lowly expressed genes, so we can simply extract all genes from this data matrix as our background gene list and save it as our ‘background’ object, as well as save to disk so that we can include it within the supplementary materials of any resultant publications for reproducibility.
background <- data$Gene.ID
cat("Number of background genes:", length(background), "\n")
## Number of background genes: 14420
# Save the background gene list to disk:
write.table(background, "Pezzini_background_genes.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
gost functionBefore running the below code chunk, review the parameters. Observe
the similarities to what was available on the g:Profiler web interface,
for example organism, the correction method (g:Profiler’s custom
g_scs method), and domain scope (background genes).
Every R package has a user guide that outlines available parameters.
Open the gprofiler2
user guide and navigate to page detailing the gost function
(currently page 6).
Note: to open this link for the RStudio training VM, highlightthe link, right click, then select “Go to https://cran…”
Under the Usage subheading, all available parameters are
listed, along with their default settings. You can over-ride these
parameters by specifying your custom values in your R command.
Run the below code which explicitly includes all available
gost parameters:
ora <- gost(
degs,
organism = "hsapiens",
ordered_query = FALSE,
multi_query = FALSE,
significant = TRUE,
exclude_iea = FALSE,
measure_underrepresentation = FALSE,
evcodes = FALSE,
user_threshold = 0.05,
correction_method = "g_SCS",
domain_scope = "custom_annotated",
custom_bg = background,
numeric_ns = "",
sources = NULL,
as_short_link = FALSE,
highlight = FALSE
)
An error-free gost run should produce no console output.
Our results are saved in the R object ora.
Since we are using many of the default parameters, we could shorten this code to what is shown below. The results would be identical, however not as transparent as far as easily identifying what parameters were applied to a run.
Tip: you can run the command formals(gost) to print out
all parameters and their default values. This is a generic R function so
can be used on other tools outside of gprofiler2.
#ora <- gost(
# degs,
# correction_method = "g_SCS",
# domain_scope = "custom_annotated",
# custom_bg = background,
#)
formals(gost)
## $query
##
##
## $organism
## [1] "hsapiens"
##
## $ordered_query
## [1] FALSE
##
## $multi_query
## [1] FALSE
##
## $significant
## [1] TRUE
##
## $exclude_iea
## [1] FALSE
##
## $measure_underrepresentation
## [1] FALSE
##
## $evcodes
## [1] FALSE
##
## $user_threshold
## [1] 0.05
##
## $correction_method
## c("g_SCS", "bonferroni", "fdr", "false_discovery_rate", "gSCS",
## "analytical")
##
## $domain_scope
## c("annotated", "known", "custom", "custom_annotated")
##
## $custom_bg
## NULL
##
## $numeric_ns
## [1] ""
##
## $sources
## NULL
##
## $as_short_link
## [1] FALSE
##
## $highlight
## [1] FALSE
View the top-most significant enrichments with the R
head command. Only significant enrichments passing your
specified threshold (adjusted P value < 0.05) are included in the
results object.
head(ora$result)
Let’s give our query a name:
ora$result$query <- 'DEGs_Padj0.05_FC2'
head(ora$result)
Use the small black arrow near the column names to view columns wider
than the page width. The head view only shows the top 6
enrichments, which are all GO Biological Process.
We can obtain a list of all databases with significant enrichments from our query list:
unique(ora$result$source)
## [1] "GO:BP" "GO:CC" "GO:MF" "HP" "HPA" "KEGG" "MIRNA" "REAC" "TF"
## [10] "WP"
Same as the web tool, we have enrichment results for GP (BP, CC, MF), HP (human phenotype), HPA (human protein atlas), KEGG, MiRNA, Reactome, Transcription Factors, and WikiPathways.
Let’s save the results file to disk. First, we will re-order the columns so the output more closely matches the tables that are downloaded from the web version of the tool.
ora_reordered <- ora$result[, c("source", "term_name", "term_id", "p_value", "term_size", "query_size", "intersection_size", "effective_domain_size")]
head(ora_reordered)
write.csv(ora_reordered, "gprofiler_ORA_results.csv", row.names = FALSE)
The gostplot function creates a Manhattan plot similar
to the one shown on the web tool. By setting
interactive=TRUE we can hover over the data points to see
enriched term details.
The parameter capped = TRUE is an indicator whether the -log10(p-values) would be capped at 16 if bigger than 16. This fixes the scale of y-axis to keep Manhattan plots from different queries comparable and is also intuitive as, statistically, p-values smaller than that can all be summarised as highly significant.
gostplot(ora,
capped = TRUE,
interactive = TRUE,
pal = c(`GO:MF` = "#dc3912",
`GO:BP` = "#ff9900",
`GO:CC` = "#109618",
KEGG = "#dd4477",
REAC = "#3366cc",
WP = "#0099c6",
TF = "#5574a6",
MIRNA = "#22aa99",
HPA = "#6633cc",
CORUM = "#66aa00",
HP = "#990099")
)
There are a lot of significant enrichments for GO biological
processes. Many of these are probably terms containing a large number of
genes, so not particularly informative. Other R tools have default
settings limiting the minimum and maximum number of genes in a geneset
to be included in the analysis. Since there is no direct parameter to
restrict term size to gostplot, we can filter the ORA
results before plotting. Let’s apply a maximum gene set size of 500, and
a minimum gene set size of 10, which are the default setting used by
clusterProfiler.
# Filter the results for GO:BP terms with term_size <= 500
ora_filter_termsize <- ora
ora_filter_termsize$result <- ora$result %>% filter(term_size <= 500) %>% filter(term_size >= 10)
# Plot with gostplot using the filtered results
gostplot(ora_filter_termsize,
capped = TRUE,
interactive = TRUE,
pal = c(
`GO:MF` = "#dc3912",
`GO:BP` = "#ff9900",
`GO:CC` = "#109618",
KEGG = "#dd4477",
REAC = "#3366cc",
WP = "#0099c6",
TF = "#5574a6",
MIRNA = "#22aa99",
HPA = "#6633cc",
CORUM = "#66aa00",
HP = "#990099"
)
)
This has cleaned up ‘Biological Process’ a little bit, enabling signals of more specific terms to be highlighted.
gprofiler2 includes a function for creating a publication-ready image
that can optionally highlight specific terms. We need to first produce a
plot with interactice = FALSE, save it to an object, and
then provide that plot object to the publish_gostplot
function.
# Plot with gostplot using the filtered results
plot <- gostplot(ora_filter_termsize,
capped = TRUE,
interactive = FALSE,
pal = c(
`GO:MF` = "#dc3912",
`GO:BP` = "#ff9900",
`GO:CC` = "#109618",
KEGG = "#dd4477",
REAC = "#3366cc",
WP = "#0099c6",
TF = "#5574a6",
MIRNA = "#22aa99",
HPA = "#6633cc",
CORUM = "#66aa00",
HP = "#990099"
)
)
The publish_gostplot parameter
highlight_terms enables you to highlight specific terms on
the plot, with a table showing enrichment details below for those
highlighted terms.
Let’s highlight some selected terms manually. You need to provide the term ID not term name.
#Term IDs for 'Collagen degradation' and 'Collagen formation'
highlight <- c("REAC:R-HSA-1442490", "REAC:R-HSA-1474290")
filename <- "gostplot_filter_termsize_collagen.png"
publish_gostplot(plot,
highlight_terms = highlight,
filename,
width = 10,
height = 10 )
## The image is saved to gostplot_filter_termsize_collagen.png
You can use R grepl function to search for terms with
names matching some keyword. Let’s highlight all terms related to
receptors. The code chunk applies an increased figure height, to ensure
we can see the whole plot within the notebook.
# Filter terms containing "receptor" keyword and create a list of those term IDs
highlight <- ora$result %>%
filter(grepl("receptor", term_name, ignore.case = TRUE)) %>%
pull(term_id)
filename <- "gostplot_filter_termsize_receptors.png"
publish_gostplot(plot,
highlight_terms = highlight,
filename,
width = 10,
height = 10 )
## The image is saved to gostplot_filter_termsize_receptors.png
One of the advantages of working in R is flexibility with
visualisations. While the interactive Manhattan plots and
publish_gostplot options are nice, it can also be useful to
visualise P values aginst all term descriptions.
One way to do this is with a dotplot. We can loop through all
databases that gost has enriched against, and use the R
package ggplot2 to make a dotplot, first filtering the
results by database and(if desired) term size:
# List of databases
dbs <- unique(ora$result$source)
# Loop through each database and create a separate plo for each
for (db in dbs) {
#db_results <- ora$result
db_results <- ora$result %>% filter(source == db, term_size >= 10, term_size <= 500)
# Create a dot plot for the current database
p <- ggplot(db_results, aes(x = reorder(term_name, -p_value), y = -log10(p_value))) +
geom_point(aes(size = term_size, color = significant)) +
labs(title = paste(db),
x = "Term",
y = "-log10(p-value)") +
theme_minimal() +
coord_flip() # Flips the coordinates for better visibility
# Print the plot
print(p)
}
For plots with a lot of enriched terms, such as GO Biological Process, the display within the notebook is less than ideal. Saving the plot to an image file enables better resolution:
# Filter for the GO:BP database
go_bp_results <- ora$result %>% filter(source == "GO:BP",term_size >= 10, term_size <= 500)
# Create the plot for GO:BP
p_go_bp <- ggplot(go_bp_results, aes(x = reorder(term_name, -p_value), y = -log10(p_value))) +
geom_point(aes(size = term_size, color = significant)) +
labs(title = "GO:BP",
x = "Term",
y = "-log10(p-value)") +
theme_minimal() +
coord_flip() # Flips the coordinates for better visibility
# Open a PDF device to save the plot as a full size A4:
pdf("gprofiler_GO_BP_dotplot.pdf", width = 8.27, height = 11.69) # A4 portrait size in inches
# Print the plot to the device
print(p_go_bp)
# Close the device (this saves the plot)
dev.off()
## png
## 2
Open this plot by clicking it from the ‘Files’ pane of RStudio. Notice how the term names are now readable :-)
By providing more than one gene list and setting
multi_query = TRUE, results from all of the gene lists are
grouped by term IDs for easier comparison. This can be handy when you
have multiple comparisons within an experiment, or when you want to
investigate enrichments within the up and down regulated genes
separately.
First, we need to re-extract our gene list so we have separate DEGs for up-regulated and down-regulated genes.
up_degs <- data %>%
filter(FDR < 0.01 & Log2FC >= 2) %>%
pull(Gene.ID)
cat("Number of upregulated DEGs:", length(up_degs), "\n")
## Number of upregulated DEGs: 577
# Save the DEG gene list to disk:
write.table(up_degs, "up_DEGs.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
down_degs <- data %>%
filter(FDR < 0.01 & Log2FC <= -2) %>%
pull(Gene.ID)
cat("Number of downregulated DEGs:", length(down_degs), "\n")
## Number of downregulated DEGs: 215
# Save the DEG gene list to disk:
write.table(down_degs, "down_DEGs.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
Now run gost as multi-query. The changes required for
multi-query are providing a list of gene list objects to the
query parameter instead of a single gene list object, and
setting multi_query = TRUE, which is FALSE by default. By
including all genes as well, we can efficintly compare up vs down vs no
separation.
ora_multi <- gost(
query = list("upregulated" = up_degs, "downregulated" = down_degs, "all_DEGs" = degs),
organism = "hsapiens",
ordered_query = FALSE,
multi_query = TRUE,
significant = TRUE,
exclude_iea = FALSE,
measure_underrepresentation = FALSE,
evcodes = FALSE,
user_threshold = 0.05,
correction_method = "g_SCS",
domain_scope = "custom_annotated",
custom_bg = background,
numeric_ns = "",
sources = NULL,
as_short_link = FALSE,
highlight = FALSE
)
Now create a multi-query interactive Manhattan plot with
gostplot:
gostplot(ora_multi, capped = TRUE, interactive = TRUE)
Unfortunately, notebook view squashes the top plot over the bottom one, and adjusting figure height or plot layout options doesn’t seem to help. Plotting as non-interactive or plotting from the console to the plots pane both produce a correct looking plot.
p <- gostplot(ora_multi, capped = TRUE, interactive = FALSE)
filename <- "gprofiler_ORA_multiquery.png"
publish_gostplot(p,
highlight_terms = NULL,
filename,
width = 10,
height = 10 )
## The image is saved to gprofiler_ORA_multiquery.png
To access the tabular results separately, they need to be split, as the default results view
NOT WORKING
NR <- nrow(ora$result)
NS <- ifelse(NR < 10, NR, 10)
termSource <- sample(nrow(ora$result), NS)
VEC <- ora$result[termSource, "term_id"]
publish_gosttable(ora,
highlight_terms = VEC,
use_colors = TRUE,
show_columns = c("source", "term_name", "term_size", "intersection_size"),
filename = NULL)
Head ora multi in notebook:
head(ora_multi$result)
NR <- nrow(ora_multi$result)
NS <- ifelse(NR < 10, NR, 10)
termSource <- sample(nrow(ora_multi$result), NS)
VEC <- ora_multi$result[termSource, "term_id"]
filename <- "publish-gosttable-multi.pdf"
publish_gosttable(ora,
highlight_terms = VEC,
use_colors = TRUE,
show_columns = c("source", "term_name", "term_size", "intersection_size", "p_values", "significant"),
filename)
## The image is saved to publish-gosttable-multi.pdf
END NOT WORKING
In day 1 of the workshop, you ran ORA with g:Profiler web tool and saved the results to a CSV. Let’s compare the results to those we have generated in R. Do we expect the results to be identical or differ slightly?
The input file here is one that we have created, but should match yours as long as you used the same P filters and gprofiler parameters.
web <- read.csv("gProfiler_hsapiens_07-11-2024_11-27-09__intersections.csv")
Check the numbers: are there any terms significant frm one tool but not the other?
# Extract significant term names
web_terms <- web$term_name
ora_terms <- ora$result$term_name
paste0("Number of significant terms from web: ", length(web_terms))
## [1] "Number of significant terms from web: 273"
paste0("Number of significant terms from R: ", length(ora_terms))
## [1] "Number of significant terms from R: 273"
# Find command and unique terms
common_terms <- intersect(web_terms, ora_terms)
if (length(common_terms) == length(web_terms) && length(common_terms) == length(ora_terms)) {
# If the lengths match, all terms are shared
print("All terms are shared")
} else {
# If there are differences, report the number of terms
unique_web <- setdiff(web_terms, ora_terms)
unique_ora <- setdiff(ora_terms, web_terms)
print(paste("Number of terms unique to web:", length(unique_web)))
print(paste("Number of terms unique to gprofiler2 (R):", length(unique_ora)))
}
## [1] "All terms are shared"
That’s a good start! Do the P values differ? Let’s look closely at GO ‘Molecular Function’.
# Filter for GO:MF terms
go_mf_web <- web %>% filter(source == "GO:MF")
go_mf_r <- ora$result %>% filter(source == "GO:MF")
# Extract term names and p values
comparison_data_go_mf <- data.frame(
term_name = go_mf_web$term_name,
p_value_web = go_mf_web$adjusted_p_value,
p_value_r = go_mf_r$p_value
)
# Reshape the data to long format
comparison_data_long <- comparison_data_go_mf %>%
pivot_longer(cols = starts_with("p_value"),
names_to = "source",
values_to = "p_value")
# Create the bar plot with -log10 transformed p-values
print(ggplot(comparison_data_long, aes(x = term_name, y = -log10(p_value), fill = source)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7) + # Side-by-side bars
labs(title = "Adjusted P value comparison for GO:MF enrichments",
x = "Term Name",
y = "-log10(P-value)") +
scale_fill_manual(values = c("p_value_web" = "#ff9900", "p_value_r" = "#3366cc")) + # Custom colors
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) )
Great! We know we applied the same parameters, and used the same input gene lists. The identical results must mean that gprofiler2 is using the same database version as g:Profiler web.
Let’s check: yesterday when you ran ORA on the web, hopefully you saved your ‘query parameters’ as well as your results.
From my run, I can see version as ‘e111_eg58_p18_f463989d’.
Let’s report the g:Profiler database version used in our analysis:
paste0("g:Profiler database version: ", ora$meta$version)
## [1] "g:Profiler database version: e111_eg58_p18_f463989d"
paste0("gprofiler2 package version: ", packageVersion("gprofiler2"))
## [1] "gprofiler2 package version: 0.2.3"
We can also capture the version of R and other session details
including all loaded packages and versions with the
sessionInfo() function:
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyr_1.3.1 ggplot2_3.5.1 gprofiler2_0.2.3 dplyr_1.1.4
## [5] readr_2.1.5
##
## loaded via a namespace (and not attached):
## [1] plotly_4.10.4 sass_0.4.9 utf8_1.2.4 generics_0.1.3
## [5] bitops_1.0-7 hms_1.1.3 digest_0.6.35 magrittr_2.0.3
## [9] evaluate_0.23 grid_4.4.2 fastmap_1.1.1 jsonlite_1.8.8
## [13] gridExtra_2.3 promises_1.3.0 httr_1.4.7 purrr_1.0.2
## [17] fansi_1.0.6 crosstalk_1.2.1 viridisLite_0.4.2 scales_1.3.0
## [21] textshaping_0.3.7 lazyeval_0.2.2 jquerylib_0.1.4 shiny_1.8.1.1
## [25] cli_3.6.2 rlang_1.1.3 crayon_1.5.2 bit64_4.0.5
## [29] munsell_0.5.1 withr_3.0.0 cachem_1.0.8 yaml_2.3.8
## [33] tools_4.4.2 parallel_4.4.2 tzdb_0.4.0 colorspace_2.1-0
## [37] httpuv_1.6.15 mime_0.12 vctrs_0.6.5 R6_2.5.1
## [41] lifecycle_1.0.4 htmlwidgets_1.6.4 bit_4.0.5 vroom_1.6.5
## [45] ragg_1.3.1 pkgconfig_2.0.3 later_1.3.2 pillar_1.9.0
## [49] bslib_0.7.0 gtable_0.3.5 Rcpp_1.0.12 data.table_1.15.4
## [53] glue_1.7.0 systemfonts_1.0.6 highr_0.10 xfun_0.43
## [57] tibble_3.2.1 tidyselect_1.2.1 rstudioapi_0.16.0 knitr_1.46
## [61] farver_2.1.2 xtable_1.8-4 htmltools_0.5.8.1 labeling_0.4.3
## [65] rmarkdown_2.26 compiler_4.4.2 RCurl_1.98-1.14
And finally, the RStudio version. Typically, we would simply run
RStudio.Version() to print the version details. However,
when we knit this document to HTML, the output is not available and will
cause an error. So to make sure our version details are saved to our
static record of the work, we will save to a file, then print the file
contents back into the notebook.
# Get RStudio version information
rstudio_info <- RStudio.Version()
# Convert the version information to a string
rstudio_version_str <- paste(
"RStudio Version Information:\n",
"Version: ", rstudio_info$version, "\n",
"Release Name: ", rstudio_info$release_name, "\n",
"Long Version: ", rstudio_info$long_version, "\n",
"Mode: ", rstudio_info$mode, "\n",
"Citation: ", rstudio_info$citation,
sep = ""
)
# Write the output to a text file
writeLines(rstudio_version_str, "rstudio_version.txt")
# Read the saved version information from the file
rstudio_version_text <- readLines("rstudio_version.txt")
# Print the version information to the document
rstudio_version_text
## [1] "RStudio Version Information:"
## [2] "Version: 2023.6.1.524"
## [3] "Release Name: Mountain Hydrangea"
## [4] "Long Version: 2023.06.1+524"
## [5] "Mode: server"
## [6] "Citation: list(title = \"RStudio: Integrated Development Environment for R\", author = list(list(given = \"Posit team\", family = NULL, role = NULL, email = NULL, comment = NULL)), organization = \"Posit Software, PBC\", address = \"Boston, MA\", year = \"2023\", url = \"http://www.posit.co/\")"
The last task is to knit the notebook. Our notebook is
editable, and can be changed. Deleting code deletes the output, so we
could lose valuable details. If we knit the notebook to HTML, we have a
permanent static copy of the work.
On the editor pane toolbar, under ‘Preview’, select ‘Knit to HTML’.
If you have already run ‘Preview’, you will see ‘Knit’ instead of ‘Preview’.